Our sequence files are distributed in gzipped fastq format
Our files are named with the SRA run accession E?SRR000000.filt.fastq.gz. All the reads in the file also hold this name. The files with _1 and _2 in their names are associated with paired end sequencing runs. If there is also a file with no number it is name this represents the fragments where the other end failed qc. The .filt in the name represents the data in the file has been filtered after retrieval from the archive. This filtering process is described in a README.
You can find all the 1000 Genomes phase 3 BAM and fastq files in:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data
All BAM files from IGSR can be found in:
Many of our individuals have multiple fastq files. This is because many of our individual were sequenced using more than one run of a sequencing machine.
Each set of files named like ERR001268_1.filt.fastq.gz, ERR001268_2.filt.fastq.gz and ERR001268.filt.fastq.gz represent all the sequence from a sequencing run.
When a individual has many files with different run accessions (e.g ERR001268), this means it was sequenced multiple times. This can either be for the same experiment, some centres used multiplexing to have better control over their coverage levels for the low coverage sequencing, or because it was sequenced using different protocols or on different platforms.
For a full description of the sequencing conducted for the project please look at our sequence.index file
We distribute our fastq files for our paired end sequencing in 2 files, mate1 is found in a file labelled _1 and mate2 is found in the file labelled _2. The files which do not have a number in their name are singled ended reads, this can be for two reasons, some sequencing early in the project was singled ended also, as we filter our fastq files as described in our README if one of a pair of reads gets rejected the other read gets placed in the single file.